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Abstract 

Statistical NLP systems are fre- 
quently evaluated and compared on 
the basis of their performances on 
a single split of training and test 
data. Results obtained using a single 
split are, however, subject to sam- 
pling noise. In this paper we ar- 
gue in favour of reporting a distri- 
bution of performance figures, ob- 
tained by resampling the training 
data, rather than a single num- 
ber. The additional information 
from distributions can be used to 
make statistically quantified state- 
ments about differences across pa- 
rameter settings, systems, and cor- 
pora. 

1 Introduction 

The common practice in evaluating statistical 
NLP systems is using a standard corpus (e.g., 
Penn TreeBank for parsing, Reuters for text 
categorization) along with a standard split be- 
tween training and test data. As systems im- 
prove, it becomes harder to achieve additional 
improvements, and the performance of vari- 
ous state-of-the-art systems is approximately 
identical. This makes performance compar- 
isons difficult. 

In this paper, we argue in favour of studying 
the distribution of performance, and present 
conclusions drawn from studying the recall 
distribution. This distribution provides mea- 
sures for answering the following questions: 

Ql: Comparing systems on given data: Is 



classifier A better than classifier B for 
given training and test data? 

Q2: Adequacy of training data to test data: 
Is a system trained on dataset X ade- 
quate for analysing dataset Yl Are fea- 
tures from X indicative in Yl 

Q3: Comparing data sets with a given sys- 
tem: If a different training set improves 
the result of system A on dataset Y\ , will 
this be the case on dataset Yi as well? 



The answers to these questions can provide 
useful insight into statistical NLP systems. 
In particular, about sensitivity to features in 
the training data, and transferability. These 
properties can be different even when similar 
performance is reported. 

A statistical treatment of Question 1 is 
presented by [Yeh (2000| ). He tests for the 
significance of performance differences on 
fixed training and test data sets. In other 
related works, Martin and Hirschbcrg (1996) 
provides an overview of significance tests 
of error differences in small samples, and 



Dietterich (1998 ) discusses results of a num- 
ber of tests. 

Questions 2 and 3 have been frequently 
raised in NLP, but not explicitly addressed, 
since the prevailing evaluation methods pro- 
vide no means of addressing them. In this 
paper we propose addressing all three ques- 
tions with a single experimental methodology, 
which uses the distribution of recall. 

2 Motivation 

Words, parts-of-speech (POS), words, or any 
feature in text may be regarded as outcomes 



of a statistical process. Therefore, word 
counts, count ratios, and other data used in 
creating statistical NLP models are statisti- 
cal quantities as well, and as such prone to 
sampling noise. Sampling noise results from 
the finiteness of the data, and the particular 
choice of training and test data. 

A model is an approximation or a more ab- 
stract representation of training data. One 
may look at a model collection of es- 

timators analogous, e.g., to the slope calcu- 
lated by linear regression. These estimators 
are statistics with a distribution related to the 
way they were obtained, which may be very 
complicated. The performance figures, being 
dependent on these estimators, have a distri- 
bution function which may be difficult to find 
theoretically. This distribution gives rise to 
intrinsic noise. 

Performance comparisons based on a single 
run or a few runs do not take these noises into 
account. Because we cannot assign the result- 
ing statements a confidence measure, they are 
more qualitative than quantitative. The de- 
gree to which we can accept such statements 
depends on the noise level and more generally, 
on the distribution of performance. 

In this paper, we use recall as a perfor- 



mance measure (cf. Section 4.4 and Section 
3.2 in ( |Yeh, 2000) )). Recall samples are ob- 
tained by resampling from training data and 
training classifiers on these samples. 

The resampling methods used here are 
cross-validation and bootstrap (Efron and 
Gong, 1983] ; |Efron and Tibshirani, 1993; , 
cf. Section ||). Section |I| presents the experi- 
mental goals and setup. Results are presented 
and discussed in Section ||, and a summary is 
provided in Section ||. 

3 The Bootstrap Method 

The bootstrap is a re-sampling technique 
designed for obtaining empirical distribu- 
tions of estimators. It can be thought 
of smoothed version of /c-fold cross- 

validation (CV). The method has been ap- 
plied to decision tree and bayesian classifiers 
by Kohavi (1995|) and to neural n etworks by, 
e.g., LeBaron and Weigend (19981) , 



In this paper, we use the bootstrap method 
to obtain the distribution of performance of a 
system which learns to identify non-recursive 
noun-phrases (base-NPs). While there are a 
few refinements of the method, the intention 
of this paper is to present the benefits of ob- 
taining distributions, rather than optimising 
bias or variance. We do not aim to study the 
properties of bootstrap estimation. 

Let a statistic S = S(x\, . . . , x n ) be a func- 
tion of the independent observations {xj}™ =1 
of a statistical variable X. The bootstrap 
method constructs the distribution function 
of S by successively re-sampling x with re- 
placements. 

After B samples, we have a set of bootstrap 
samples {x\, . . . , x h n }f =1 , each of which yields 
an estimate S b for S. The distribution of S is 
the bootstrap estimate for the distribution of 
S. That distribution is mostly used for esti- 
mating the standard deviation, bias, or confi- 
dence interval of S. 

In the present work, xi are the base-NP in- 
stances in a given corpus, and the statistic S 
is the recall on a test set. 

4 Experimental Setup 

The aim of our experiments is to test whether 
the recall distribution can be helpful in an- 
swering the questions Q1-Q3 mentioned in 
the introduction of this paper. 

The data and learning algorithms are pre- 



sented in Sections 4.1 and 4.2. Section 4.3 



describes the sampling method in detail. Sec- 



tion 4.4 motivates the use of recall and de- 



scribes the experiments. 



4.1 Data 



We used Penn-Treebank (Marcus et al., 1993) 
data, presented in Table |l| Wall-Street Jour- 
nal (WSJ) Sections 15-18 and 20 were used 
by |Ramshaw and Marcus (1995| ) as training 
and test data respectively for evaluating their 
base-NP chunker. These data have since be- 
come a standard for evaluating base-NP sys- 
tems. 

The WSJ texts are economic newspaper 
reports, which often include elaborated sen- 
tences containing about six base-NPs on the 



bource 


Sentences 


TIT 1 

Words 


Base 








NPs 


WhJ 15-18 


8936 


onn coo 

229598 


54760 


WSJ 20 


2012 


51401 


12335 


ATIS 


190 


2046 


613 


WSJ 20a 


100 


2479 


614 


WSJ 20b 


93 


2661 


619 



Table 1: Data sources 

average. 

The ATIS data, on the other hand, are 
a collection of customer requests related to 
flight schedules. These typically include short 
sentences which contain only three base-NPs 
on the average. For example: 

I have a friend living in Denver 
that would like to visit me 
here in Washington DC . 

The structure of sentences in the ATIS data 
differs significantly from that in the WSJ 
data. We expect this difference to be reflected 
in the recall of systems tested on both data 
sets. 

The small size of the ATIS data can influ- 
ence the results as well. To distinguish the 
size effect from the structural differences, we 
drew two equally small samples from WSJ 
Section 20. These samples, WSJ20a and 
WSJ20b, consist of the first 100 and the fol- 
lowing 93 sentences respectively. There is a 
slight difference in size because sentences were 



kept complete, as explained Section [4.3 . 

4.2 Learning Algorithms 

We evaluated base-NP learning systems based 



on two algorithms: MBSL (Argamon et al., 
1999|) and SNoW flMuhoz et al., 19991 ). 



MBSL is a memory-based system which 
records, for each POS sequence containing a 
border (left, right, or both) of a base-NP, the 
number of times it appears with that border 
vs. the number of times it appears without 
it. It is possible to set an upper limit on the 
length of the POS sequences. 

Given a sentence, represented by a sequence 
of POS tags, the system examines each sub- 
sequence for being a base-NP. This is done 
by attempting to tile it using POS sequences 



that appeared in the training data with the 
base-NP borders at the same locations. 

For the purpose of the present work, suffice 
it to mention that one of the parameters is 
the context size(c). It denotes the maximal 
number of words considered before or after a 
base-NP when recording sub-sequences con- 
taining a border. 

SNoW QRoth, 1998| , "Sparse Network of 
Winnow") is a network architecture of Win- 
now classifiers ( [Littlestone, 198*8 ) . Winnow 



is a mistake-driven algorithm for learning a 
linear separator, in which feature weights are 
updated by multiplication. The Winnow al- 
gorithm is known for being able to learn well 
even in the presence of many noisy features. 

The features consist of one to four consec- 
utive POSs in a 3-word window around each 
POS. Each word is classified as a beginning of 
a base-NP, as an end, or neither. 

4.3 Sampling Method 

In generating the training samples we sampled 
complete sentences. In MBSL, an un-marked 
boundary may be counted as a negative ex- 
ample for the POS-subsequences which con- 
tains it. Therefore, sampling only part of the 
base-NPs in a sentence will generate negative 
examples. 

For SNoW, each word is an example, but 
most of the words are neither a beginning nor 
an end of a base-NP. Random sampling of 
words might generate a sample with an im- 
proper balance between the three classes. 

To avoid these problems, we sampled 
full sentences instead of words or instances. 
Within a good approximation, it can be as- 
sumed that base-NP patterns in a sentence do 
not correlate. The base-NP instances drawn 
from the sampled sentences can therefore be 
regarded as independent. 

As described at the end of Sec. 4.1, the 



WSJ20a and WSJ20b data were created so 
that they contain 613 instances, like the ATIS 
data. In practice, the number of instances 
exceeds 613 slightly due to the full-sentence 
constraint. For the purpose of this work, it is 
enough that their size is very close to the size 
of ATIS. 



Dataset 


Sentences 


Base-NPs 


Training 
Unique: 


8938 ± 48 
5648 ± 34 


54763 ± 2 



Table 2: Sentence and instant counts for the 
bootstrap samples. The second line refers to 
unique sentences in the training data. 

We used the WSJ15-18 dataset for train- 
ing. This dataset contains uq = 54760 base- 
NP instances. The number of instances in a 
bootstrap sample depends on the number of 
instances in the last sampled sentence. As 
Table [2] shows, it is slightly more than uq. 

For fc-CV sampling, the data were divided 
into k random distinct parts, each containing 
^±2 instances. Table ^ shows the number of 
recall samples in each experiment (MBSL and 
SNoW experiments were carried out seper- 
ately) . 



Method 


MBSL 


SNoW 


Bootstrap 

CV (total folds) 


2200 
1500 


1000 
1000 



Table 3: Number of bootstrap samples and 
total CV folds. 



4.4 Experiments 

We trained SNoW and MBSL; the latter us- 
ing context sizes of c=l and c=3. Data sets 
WSJ20, ATIS, WSJ20a, and WSJ20b were 
used for testing. MBSL runs with the two 
c values were conducted on the same training 
samples, therefore it is possible to compare 
their results directly. 

Each run yielded recall and precision. Re- 
call may be viewed as the expected 0-1 loss- 
function on the given test sample of instances. 
Precision, on the other hand, may be viewed 
as the expected 0-1 loss on the sample of in- 
stances detected by the learning system. Care 
should be taken when discussing the distribu- 
tion of precision values because this sample 
varies from run to run. We will therefore only 
analyse the distribution of recall in this work. 

In the following, r 1 and r 3 denote recall 
samples of MBSL with c = 1 and c = 3, 
with standard deviations a 1 and a 3 , p 13 de- 



notes the cross-correlation between r 1 and r 3 . 
SNoW recall and standard deviation will be 
denoted by r SN and a . 

To approach the questions raised in the in- 
troduction we made the following measure- 
ments: 

Ql: System comparison was addressed by 
comparing r 1 and r 3 on the same test data. 
With samples at hand, we obtained an esti- 
mate of P(r 3 > r 1 ). 

Q2: We studied training and test adequacy 
through the effect of more specific features on 
recall, and on its standard deviation. 

Setting c = 3 takes into account sequences 
with context of two and three words in ad- 
dition to those with c = 1. Sequences with 
larger context are more specific, and an im- 
provement in recall implies that they are in- 
formative in the test data as well. 

For particular choices of parameters and 
test data, the recall spread yields an estimate 
of the training sampling noise. On inade- 
quate data, where the statistics differ signif- 
icantly from those in the training data, even 
small changes in the model can lead to a no- 
ticeable difference in recall. This is because 
the model relies on statistics which appear 
relatively rarely in the test data. Not only 
do these statistics provide little information 
about the problem, but even small differences 
in weighting them are relatively influential. 

Therefore, the more training and test data 
differ from each other, the more spread we can 
expect in results. 

Q3: For comparing test data sets with a 
system, we used cross-correlations between r , 
r 3 , or r SN samples obtained on these data sets. 
We know that WSJ data are different from 
ATIS data, and so expect the results on WSJ 
to correlate with ATIS results less than with 
other WSJ results. 

5 Results and Discussion 

For each of the five test datasets, Table Q re- 
ports averages and standard deviations of r , 
r 3 , and r SN obtained by 3, 5, 10, and 20-fold 
cross-validation, and by bootstrap, p 13 and 
P(r 3 > r 1 ) are reported as well. 

We discuss our results by considering to 



what extent they provide information for an- 
swering the three questions: 

Ql Comparing systems on given data: 

For the WSJ data sets, the difference between 
r 3 and r 1 was well above their standard de- 
viations, and r 3 > r 1 nearly always. For 
ATIS, the standard deviation of the differ- 
ence (a%_ rl = (a 1 ) 2 + (a 3 ) 2 - 2aV 3 • p 13 ) 
was small due to the high p 13 , and r 1 > r 3 
nearly always. 

Q2 — The adequacy of training and test 
sets: It is clear that adding more specific 
features, by increasing the context, improved 
recall on the WSJ test data and degraded it 
on the ATIS data. This is likely to be an indi- 
cation of the difference in syntactic structure 
between ATIS and WSJ texts. 

Another evidence of structural difference 
comes from standard deviations. The spread 
of the ATIS results always exceeded that of 
the WSJ results, with all three experiments. 
That difference cannot be solely attributed 
to the small size of ATIS, since WSJ20a 
and WSJ20b results displayed a much smaller 
spread. Indeed, these results had a wider 
standard deviation than WSJ20, probably 
due to the smaller size, but not as wide as 
ATIS. This indicates that base-NPs in ATIS 
text have different characteristics than those 
in WSJ texts. 

Q3 — Comparing datasets by a system: 

Table |5] reports, for each pair of datasets, the 
correlation between the 5-fold CV recall sam- 
ples of each experiment on these datasets. 
The correlations change with CV fold num- 
ber, 5-fold results were chosen as they repre- 
sent intermediary values. 

Both MBSL experiments yielded negligible 
correlations of ATIS results with any WSJ 
data set, whether large or small. These corre- 
lations were always weaker than with WSJ20a 
and WSJ20b, which are about the same size. 

This is due to ATIS being a different kind 
of text. The correlation between WSJ20a and 
WSJ20b results was also weak. This may be 
due to their small sizes; these texts might not 
share enough features to make a significant 



correlation. 

SNoW results were highly correlated for all 
pairs. That behaviour is markedly different 
from the MBSL results, and indicates a high 
level of noise in the SNoW features. Indeed, 
Winnow is able to learn well in the presence 
of noise, but that noise causes the high corre- 
lations observed here. 

5.1 Further Observations 

The decrease of p 13 with CV fold number is 
related to stabilization of the system. As the 
folds become larger, training samples become 
more similar to each other, and the spread of 
results decreases. This effect was not visible 
in the SNoW data, most likely due to the high 
level of noise in the features. This noise also 
contributes to the higher standard deviation 
of SNoW results. 

6 Summary and Further Research 

In this work, we used the distribution of re- 
call to address questions concerning base-NP 
learning systems and corpora. Two of these 
questions, of training and test adequacy, and 
of comparing data sets using NLP systems, 
were not addressed before. 

The recall distributions were obtained using 
CV and bootstrap resampling. 

We found differences between algorithms 
with similar recall, related to the features they 
use. 

We demonstrated that using an inadequate 
test set may lead to noisy performance results. 
This effect was observed with two different 
learning algorithms. We also reported a case 
when changing a parameter of a learning al- 
gorithm improved results on one dataset but 
degraded results on another. 

We used classifiers as "similarity rulers", 
for producing a similarity measure between 
datasets. Classifiers may have various prop- 
erties as similarity rulers, even when their re- 
calls are similar. Each classifier should be 
scaled differently according to its noise level. 
This demonstrates the way we can use clas- 
sifiers to study data, as well as use data to 
study classifiers. 



Test 


Method 


MBSL 


SNoW 


data 




E(r l ) ±a l 


E(r A ) ± a 6 




P{r A > r 1 ) 


V / 




3-CV 


89.64 ± 0.16 


91.26 ± 0.12 


0.36 


100% 


90.18 ± 1.01 




5-CV 


89.75 ± 0.14 


91.43 ± 0.10 


0.30 


100% 


90.37 ± 1.03 


WSJ 20 


10-CV 


89.80 ± 0.12 


91.53 ± 0.08 


0.25 


100% 


90.47 ± 1.11 




20-CV 


89.81 ± 0.11 


91.56 ± 0.07 


0.28 


100% 


90.51 ± 1.19 




Bootstrap 


89.58 ± 0.17 


91.16 ± 0.14 


0.42 


100% 


89.83 ± 0.93 




E(-) 


89.74 


91.58 




91.23 




3-CV 


85.70 ± 2.03 


83.99 ± 1.87 


0.82 


3% 


83.70 ± 4.11 




5-CV 


85.76 ± 1.87 


83.69 ± 1.57 


0.79 


1% 


83.53 ± 4.52 


ATIS 


10-CV 


85.90 ± 1.31 


84.78 ± 0.92 


0.78 


4% 


83.38 ± 5.14 




20-CV 


85.78 ± 1.16 


83.28 ± 0.85 


0.77 


0% 


83.23 ± 5.36 




Bootstrap 


85.72 ± 1.95 


84.69 ± 1.95 


0.81 


16% 


83.50 ± 3.35 




E(-) 


85.81 


83.20 




85.48 




3-CV 


89.45 ± 0.42 


91.25 ± 0.56 


0.33 


100% 


90.84 ± 1.04 




5-CV 


89.66 ± 0.36 


91.64 ± 0.54 


0.32 


100% 


91.07 ± 1.15 


WSJ 20a 


10-CV 


89.79 ± 0.28 


91.85 ± 0.49 


0.20 


100% 


91.14 ± 1.26 




20-CV 


89.82 ± 0.23 


91.89 ± 0.44 


0.18 


100% 


91.11 ± 1.39 




Bootstrap 


89.42 ± 0.47 


91.55 ± 0.57 


0.33 


99% 


90.76 ± 1.00 




E(.) 


89.73 


92.18 




90.07 




3-CV 


88.95 ± 0.41 


90.12 ± 0.39 


0.37 


99% 


89.79 ± 0.81 




5-CV 


89.03 ± 0.36 


90.15 ± 0.31 


0.31 


99% 


89.81 ± 0.84 


WSJ 20b 


10-CV 


89.06 ± 0.33 


90.14 ± 0.22 


0.28 


99% 


89.83 ± 0.86 




20-CV 


89.07 ± 0.27 


90.13 ± 0.18 


0.22 


100% 


89.87 ± 0.88 




Bootstrap 


89.00 ± 0.44 


90.17 ± 0.44 


0.38 


98% 


89.93 ± 0.80 




E{.) 


89.01 


91.55 




90.79 



Table 4: Recall statistic summary for MBSL with contexts c = 1 and c = 3, and SNoW. The 
E(-) figures were obtained using the full training set. Note the monotonic change of standard 
deviation with fold number. The s.d. of the bootstrap samples are closest to those of low-fold 
CV samples. 



5-CV 


WSJ 20b 


WSJ 20a 






3 






3 


r SN 


WSJ 20 


0.33 


0.19 


0.72 


0.26 


0.29 


0.78 


ATIS 


-0.01 


0.00 


0.59 


0.02 


-0.01 


0.63 


WSJ 20a 


0.07 


0.04 


0.59 





ATIS 

^3 



„SN 



0.08 0.02 0.76 



Table 5: Cross-correlations between recalls of the three experiments on the test data for five-fold 
CV. Correlations of r 1 capture dataset similarity in the best way. 



By using MBSL with different context sizes, 
our results provide insights into the relation 
between training and test data sets, in terms 
of general and specific features. That issue be- 
comes important when one plans to use a sys- 
tem trained on certain data set for analysing 
an arbitrary text. Another approach to this 
topic, examining the effect of using lexical 
bigram information, which is very corpus- 
specific, appears in flGildea, 2001 ). 



In our experiments with systems trained on 
WSJ data, there was a clear difference be- 
tween their behaviour on other WSJ data and 
on the ATIS data set, in which the structure 
of base-NPs is different. That difference was 
observed with correlations and standard devi- 
ations. This shows that resampling the train- 
ing data is essential for noticing these struc- 
ture differences. 

To control the effect of small size of the 
ATIS dataset, we provided two equally-small 
WSJ data sets. The effect of different genres 
was stronger than that of the small-size. 

In future study, it would be helpful to study 
the distribution of recall using training and 
test data from a few genres, across genres, 
and on combinations (e.g. "known-similarity 
corpora" ( Kilgarriff and Rose, 1998| )). This 
will provide a measure of the transferability 
of a model. 

We would like to study whether there is a 
relation between bootstrap and 2 or 3-CV re- 
sults. The average number of unique base- 
NPs in a random bootstrap training sample 
is about 63% of the total training instances 
(Table ||). That corresponds roughly to the 
size of a 3-CV training sample. More work is 
required to see whether this relation between 
bootstrap and low-fold CV is meaningful. 

We also plan to study the distribution of 



precision. As mentioned in Sec. |4.4| , the pre- 
cisions of different runs are now taken from 
different sample spaces. This makes the boot- 
strap estimator unsuitable, and more study is 
required to overcome this problem. 
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